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Abstract 

Background: Biological networks provide fundamental insights into the functional characterization of genes and 
their products, the characterization of DNA-protein interactions, the identification of regulatory mechanisms, and 
other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces 
many algorithmic challenges. 

Results: We introduce novel weighted quasi-biclique problems to identify functional modules in biological 
networks when represented by bipartite graphs. In difference to previous quasi-biclique problems, we include 
biological interaction levels by using edge-weighted quasi-bicliques. While we prove that our problems are NP- 
hard, we also describe IP formulations to compute exact solutions for moderately sized networks. 

Conclusions: We verify the effectiveness of our IP solutions using both simulation and empirical data. The 
simulation shows high quasi-biclique recall rates, and the empirical data corroborate the abilities of our weighted 
quasi-bicliques in extracting features and recovering missing interactions from biological networks. 



Introduction 

Cellular processes such as transcription, replication, meta- 
bolic catalyses, or the transport of substances are carried 
out by molecules that are associated in functional mod- 
ules, and are often realized as physical interaction within 
protein complexes. These physical interactions form mole- 
cular networks. Analyzing these networks is a thriving 
field (e.g. [1]) and has extensive implications for a host of 
issues in biology, pharmacology [1], and medicine [2]. 
Capturing the modularities of molecular networks accu- 
rately will gain insights into cellular processes and gene 
function. Yet, before such modularities can be reliably 
inferred, challenging computational problems have to be 
overcome. 

These computational problems typically result from 
incomplete and error-prone networks that largely obfus- 
cate the reliable identification of modules [3,4]. Often, 
molecular interactions can not be measured to the 
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accuracy of the genome sequences, leaving some guess- 
work in identifying modularities correctly. Some mole- 
cular interactions are highly transient and can only be 
measured indirectly, while others withstand denaturing 
agents. Functional interaction does not even have to be 
realized via physical interactions. Thus, computational 
methods for capturing modularity can not directly rely 
on presence or absence of interactions in molecular net- 
works and need to be able to cope with substantial 
error rates. 

Unweighted quasi-biclique approaches have been used 
in the past to identify modularity in protein interaction 
networks when presented as bipartite graphs that are 
spanned between different features of proteins, e.g. bind- 
ing sites and domain content function [3,5]. An example 
is depicted in Figure 1. While these approaches aim to 
solve NP-hard problems using heuristics, they were able 
to identify some highly interactive protein complexes 
[6,7]. 

Unweighted quasi-biclique approaches are sensitive to 
the quantitative uncertainties intrinsic to molecular net- 
works. Interactions are only represented by an unweighted 
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Figure 1 An example of a quasi-biclique. A quasi-biclique (darker nodes and solid edges) identified from a gene interaction network in one 
of our experiment sets where the edge weights are interaction scores. The bipartite graph is unweighted if only the existence of edges are 
considered. 



edge in the bipartite graph if they are above some user- 
specified threshold. Therefore, unweighted quasi-biclique 
approaches are prone to disregard many of the invaluable 
interactions that are below the threshold, and treat all 
interactions above the threshold the same. Further, some 
interactions may or may not be represented due to some 
seemingly insignificant error in the measurement. Conse- 
quently, many crucial modules may be concealed and 
remain undetected by using unweighted quasi-biclique 
approaches. 

Here we introduce novel weighted quasi-biclique pro- 
blems by using bipartite graphs where edges are 
weighted by the level of the corresponding interactions, 
e.g., Figure 1. We show that these problems are, similar 
to their unweighted versions, NP-hard. However, in 
practice, exact Integer Programming (IP) formulations 
can efficiently tackle many NP-hard real-world problems 
[8]. Therefore, we describe exact Integer Programming 
(IP) solutions for our weighted quasi-biclique problems. 
Furthermore our IP solutions exploit the sparseness 
of molecular networks when represented as bipartite 
graphs. This allows to verify the ability of our IP solu- 
tions using a moderately sized genetic network [9], and 
simulation studies. In addition our IP solutions can pro- 
vide exact results for instances of the unweighted quasi- 
biclique approaches that were previously not available. 

Related work 

Maximal bicliques in biological networks are self- 
contained elements characterizing functional modules. 



In protein interaction networks they manifest as interac- 
tive protein complexes (e.g., [3,7]). Bipartite graphs are 
graphs whose vertices can be bi-partitioned into sets X 
and Y such that each edge is incident to vertices in X 
and Y. A biclique is a subgraph of a bipartite graph 
where every vertex in one partition is connected to 
every vertex in the other partition by an edge. A biclique 
is maximal if it is not properly contained in any other 
biclique, and it is maximum if no other maximal bicli- 
ques have larger total edge weights. The problem of 
finding maximum bicliques is well studied in the litera- 
ture of graph theory and is known to be NP-complete 
[10] and effective heuristics for this problem have been 
described and used in various applications [11]. How- 
ever, bicliques are too stringent for identifying modules 
in real world networks [12]. For example a module is 
not identified through a biclique that is incomplete by 
one single edge. Quasi-bicliques are partially incomplete 
bicliques that overcome this limitation. They allow a 
specified maximum number of edges to be missed in 
order to form a biclique [13]. While quasi-bicliques are 
less stringent for the identification of modules, they 
might contain genes that are interacting with only a few 
or none other of the genes. Such situations occur when 
the missing edges are not homogeneously distributed 
throughout. The S-quasi-bicliques (S-QB) [14] allow to 
control the distribution of missing edges by setting 
lower bounds, parametrized by 8, on the minimum 
number of incident edges to vertices in each of the ver- 
tex sets in a <5-QB. 
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Our contributions 

Here we define a "weighted" version of 5-QB, called afi- 
weighted quasi-bicliques (a,/3-WQB), to improve on the 
identification of modules in molecular networks by 
using the interaction levels between genes. Thus, afi- 
WQB's may be better applicable to handle noisy data 
sets as they distribute the overall missing information 
across the vertices of the quasi-biclique as shown in our 
simulations. We define two versions of a,/f-WQB's, each 
in terms of the amount of weight the quasi-biclique is 
allowed to miss. The two different versions of weighted 
quasi-bicliques provide flexibility in choosing the miss- 
ing weight. For the first version, called the percentage 
version, we define the missing weight in terms of the 
percentage of number of vertices in the quasi biclique. 
While for the second version, called the constant ver- 
sion, the missing weight is defined as a constant. The 
need for constant version weighted quasi-bicliques arises 
from the fact that, for certain applications, the weight 
allowed to be missed does not depend on any of the 
graph parameters and is a constant. Finding a maximum 
a,/3-WQB in a given edge weighted bipartite graph is 
NP-hard, since it is a generalization of the NP-hard pro- 
blem to find «5-QBs in unweighted bipartite graphs [3]. 
We also introduce a "query" version of the maximum a, 
/f-WQB problem that allows biologists to focus their 
analyzes on genes of their particular interest. Given a 
network and specific genes from this network, called 
query, the query problem is to find a maximum 
weighted a,/J-WQB that includes the query. We prove 
this problem to be NP-hard. While the maximum a,fi- 
WQB problem and its query version are NP-hard, we 
provide exact IP solutions to solve both problems. By 
reducing the number of required variables and exploit- 
ing the sparseness of bipartite graphs representing mole- 
cular networks, our solutions solve moderate-sized 
instances. This allows us to verify the applicability of a, 
/3-WQB by analyzing the most complete data set of 
genetic interactions available for the Eukaryotic model 
organism Saccharomyces cerevisiae. Our results not only 
extract meaningful yet unexpected quasi-bicliques under 
functional classes, but also suggest higher possibilities of 
recovering missing interactions not presented in the 
input. A preliminary version of this work appeared in 
ISBRA 2011 [15]. In this paper, we extend the usage of 
the parameters a and ji such that the edge weight 
threshold can be either a ratio of the a,/3-WQB size or a 
constant. The time complexity and the application of 
this extended a,/3-WQB are both discussed. 

Results and discussion 

Before analyzing our findings in biological networks, we 
first introduce formal definitions of weighted-quasi 



bicliques (WQB) and then discuss the results of apply- 
ing the WQB as a data mining tool. 

Preliminaries 

A bipartite graph, denoted by (U + V, E), is a graph 
whose vertex set can be partitioned into the sets U and V 
such that its edge set E consists only of edges {u, v} 
where ue U and v e V (U and V are independent sets). 
Let G := (U + V, E) be a bipartite graph. The graph G is 
called complete if for any two vertices u e U and v e V 
there is an edge {u, v} e E. A biclique in G is a pair {W, 
V) that induces a complete bipartite subgraph in G, 
where W £ U and V £ V. Since any subgraph induced by 
a biclique is a complete bipartite graph, we use the two 
terms interchangeably. A pair (U, V) includes another 
pair (W, V) if W £ U and V £ V. In such case, we also 
say that the pair (W, V) is included in (U, V). A pair (U, 
V ) is non-empty if both U and V are non-empty. A 
weighted bipartite graph, denoted by (U + V, E, m), is a 
complete bipartite graph (U + V, E) with a weight func- 
tion (u:£-> [0, 1]. 

Maximum weighted quasi-biclique (a,[)-WQB) problem 
Definition 1 (a,B-WQB P )). Let G := (U + V, E, w) and afi 

e [0, 1]. A percentage version a,/3- weighted quasi-biclique, 
denoted as a,/J-WQB P , in G is a non-empty pair (W, V) 
that is included in (U, V) and satisfies the two properties: 

(l)Vue W : Z V£ v m(u, v)>a\ V'\, and (2) Vve V : 
Xu^u-coiu, v) >p\U'\- 

Definition 2 (a,/3-WQB c ). Let G := (U + V, E, m) 
and a,[3 e [0, °°). A constant version a,/3-weighted 
quasi-biclique, denoted as a,/3-WQB c , in G is a non- 
empty pair (W, V) that is included in {U, V ) and satis- 
fies the two properties: 

(l)Vue W : v co(u, v) > | V'\ - a, and (2) Vv e V 
■ Z M£ w ^{u, v) > | U'\ - ji. 

In either version, the weight of an a,/J-WQB is defined 
as the sum of all its edge weights. 

Definition 3 (Maximum a,j3-WQB P(C) ). A afi-WQBp 
( C ), is a maximum a.fi-WQBp^ of a weighted bipartite 
graph G := (U + V, E, m), if its weight is at least as 
much as the weight of any other a,f$-WQB P (C) in G for 
given values of a and 

Problem 1 (c^-WQB^q). 

Instance: A weighted bipartite graph G := (U + V, E, co), 
and values a, fi e [0, 1]([0, °o)). 

Find: A maximum weighted a,/3-WQBp(C) in G. 

Note that, we use the same notation (a,/3-WQB/>( C )) 
for a, /J-weighted quasi-biclique and maximum weighted 
a, /^-weighted quasi-biclique problem of either version. 
The context in which we use the notation will make the 
difference clear. Also, when we just say a,/J-WQB, the 
context will make clear if we are referring to percentage 
version or constant version or both. 
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Query problem 

A common requirement in the analysis of networks is to 
provide the environment of a certain group of genes, 
which translates into finding the maximum weighted a, 
/J-WQBp(c) which includes a specific set of vertices. We 
call this the query problem and is defined as follows. 
Problem 2 (Query P (c>). 

Instance: A weighted bipartite graph G := (U + V, E, 
m), values a, fi e [0, 1] ([0, °°)), and a pair (P, Q) 
included in (U, V). 

Find: The afi-WQBp^ which includes (P, Q) and has 
a weight greater than or equal to the weight of any a,fi- 
WQB P (C) which includes (P, Q). 

Experiment results 

Finding appropriate values for a and fi is a critical part 
in an application. The a and fi values allow the user to 
custom-tailor the search based on the weight distribu- 
tion and the expected findings of the particular applica- 
tion. Typically, quasi-bicliques of different a and fi 
values have to be analyzed by the domain experts in 
order to optimize the findings. We use simulated data 
sets to explore the problem of finding right a and fi 
values. We then use our IP model to explore a,fi- 
WQB's in a real world application, a recent data set of 
functional groups formed in genetic interactions. The 
filtered data set, compared to the raw data, served to 
investigate the role of non-existing edges in the input 
bipartite graph. While mathematically equivalent in the 
modeling step, a non-edge in a experimentally generated 
network represents either a true non-interaction or a 
false-negative. Assuming the input consists of meaning- 
ful features, our preliminary results show that a,fi- 
WQB's may recover missing edges with potentially 
higher weights better than <5-quasi-bicliques. 

Simulations 

As part of simulation studies, we try to retrieve a known 
maximum weighted quasi biclique from a weighted bipar- 
tite graph using both versions of a,/3-WQB's. In each 
simulation experiment we do the following. The pair (U, 
V) represents the vertices of a weighted bipartite graph G. 
We randomly choose W £ U and V £ Vzs vertices of the 
known quasi biclique in G. The sizes of both W and V are 
set the same and is picked randomly, but is limited to a 
specific percentage of the total vertices on each side. Ran- 
dom edges between the vertices of W and V in G are 
introduced according to a pre-determined edge density d. 
The edges between vertices of U\W and V\V oi G are also 
generated randomly according to a pre-determined density 
d'. The edge weights of the known quasi-biclique (W, V) 
are determined by a Gaussian distribution with a mean 
mn and standard deviation dev. Weights of the edges of G 
not present in the quasi-biclique are also determined by a 



Gaussian distribution with a lower mean mn and standard 
deviation dev'. We now retrieve maximum weighted a,fi- 
WQBp and a,fi-WQB c from G by using specific values a 
and fi calculated as described below. 

For retrieving a,fi-WQB P , the values a and fi are cho- 
sen in two different ways. As part of the simulation we 
evaluate the performance of both methods. The first 
method sets both a and fi to the mean of the weights of 
the edges of the quasi biclique. In the second method, a 
and fi are calculated as given below: 

a = min{CV|C 1/ = w{u', i/)) / |V|for all u' e If') 

0 = minjCVIC/ = (J2 u , eu > W ( U '' V " > ) I |L/ ' |for a11 V ' G V ' ! 

Similarly, for retrieving a,/J-WQB c , the values a and fi 
are calculated as given below. 

a = \V\ - min{C„<|CV = (j^^v W ^ U '' ^) for a11 G U ] 
P = \U\- minfC/IC,/ = (J2 u , eu , w ( u '' for a11 v ' e V ' } 

The ILP models of the corresponding a,/f-WQB pro- 
blems are generated in Python, and solved in Gurobi 4 
[16] on a PC with an Intel Core2 Quad 2.4 GHz CPU 
with 8 GB memory. 

For the evaluation, let {U", V") represent the maxi- 
mum weighted a,/J-WQB returned by the ILP model. 
The percentage of the vertices of W in U" is called the 
recall of W. Similarly, the percentage of vertices of V in 
V is called the recall of V. The recalls of W and V are 
our evaluation criteria. For a specific graph sizes experi- 
ments were run by varying the values mn and mn'. The 
values dev and dev' were set 0.1. The densities d and d' 
are set to 0.8 and 0.2. The experiments were run for 
graphs of size 16 x 16, 32 x 32 and 40 x 40. Each 
experiment is repeated thrice and the average number 
of recalled vertices is calculated. The recall of the 
experiments can be seen in Table 1. For each graph size 
the first two columns represent the recall values for per- 
centage version a,/J-WQB's and the third column repre- 
sents the recall value for constant version a,[i-WQB. As 
the difference between the means increases, so does the 
average recall. The second method of choosing a and fi 
for percentage version a,[i-WQB yields a consistently 
higher recall. 

Genetic interaction networks 

A comprehensive set of genetic interaction and func- 
tional annotation published recently by Costanzo et al. 
[9] is amongst the best single data sources for weighted 
biological networks. The aim of our application is to 
identify the maximum weighted quasi-bicliques consist- 
ing of genes in different functional classes in the Cost- 
anzo dataset. 
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Table 1 Simulation results of a, /3-WQB recall 
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Recall of vertices in the simulation. For every experiment, the value in the A U (A V ) column represents the average recall percentage of U\V). The results are 
different from the preliminary version due to randomness in the simulation experiments. 



Pairwise comparisons of the total 18 functional classes 
provide 153 sets. For every distinct pair {A, B) of such 
classes, we build a weighted bipartite graph (U A + V B , E, 
m) where genes from functional class A are represented as 
vertices in U A and genes from functional class B are repre- 
sented as vertices in V B . 

The absolute values of the interaction score e, are 
used as the edge weights. Values greater than 1 are 
rounded off to 1. Any gene present in both the func- 
tional classes A and B is represented as different vertices 
in the partitions U A and V B and the edge between those 
vertices is given a weight of 1. We build LP models of 
both a,/3-WQB versions for the bipartite graphs to iden- 
tify the maximum weighted quasi-bicliques. 
Biological interpretation and examples 
Genes with high degree and strong links dominate the 
results. In several instances, the quasi-bicliques are tri- 
vial in the sense that only one gene is present in IT, and 
it is linked to more than 20 genes in V. Such quasi- 
bicliques are maximal by definition but provide limited 
insight. A minimum of m = 1 genes per subset was 
included as an additional constraint to the LP model. It 
might be sensible to implement such restrictions in the 
application in general. 

We observed the following with the maximum 
weighted a,/3-WQB//s in the data sets. Given the low 
overall weight, the a,/J-WQBp's generated with the para- 
meters a = P = 0.1 were the most revealing. Though we 
found many interesting quasi-bicliques in the 153 bipar- 
tite graphs, we only present a couple of them here. A 
notable latent set that was obtained identified genes 
involved in amino acid biosynthesis (SER2, THR4, 
HOM6, URE2) and was found to form a 4 x 10 maxi- 
mum weighted quasi-biclique with genes coding for pro- 
teins of the translation machinery, elongation factors in 
particular (ELP2, ELP3, ELP4, ELP6, STP1, YPL102C, 
DEG1, RPL35A, IKI3, RPP1A). These connections, to 
our knowledge, are not described and one might specu- 
late that this is a way how translation is coupled to the 



amino-acid biosynthesis. In some cases the maximum 
weighted quasi-biclique is centered around the genes 
that are annotated in more than one functional class as 
they provide strong weights. These genes are involved in 
mitochondrial to nucleus-signaling and are examples 
where our approach recovers known facts. Using the 
query approach, it is possible to obtain quasi-bicliques 
around a gene set of interest quickly and extend the 
approach proteins of interest. 

Maximum weighted a,/3-WQB c 's generated from the 
data sets with parameters a = fi = 5 reveal the following. 
Genetic interaction networks allow to study protein- 
coding genes as well as genes that might only code for 
RNAs. A noteworthy example was discovered in the 
comparison of genes involved in nuclear transport and 
those with an unknown bioprocess revealed proteins 
that are part of the nuclear pore transport (POM34, 
NUP60, NUP157, THP2 and POM152). They interact 
with a number of genes that are lined up on chromo- 
some 15 (YML033W, YML034C (SRC1) and YML035C- 
A) as well as and YDR431W. Most of these genes they 
interact with are annotated as "dubious" in the current 
version of the Yeast Genome Database SGD [17]. SRC1 
overlaps with another uncharacterized gene YML034C- 
A. It would be possible that locus codes of a long RNA 
are involved in nuclear transport. 
Recovering missing edges 

The published data sets have edges under different 
thresholds removed. To sample such missing edges, we 
calculate the average weight of all the edges removed in 
the 153 bipartite graphs (generated above), and the cal- 
culated average weight is 0.0522. 

For each of the 153 maximum weighted quasi-bicliques 
of either version, the missing edges induced by the quasi- 
bicliques are then identified, and the average missing 
edge weight e of each is calculated. The average missing 
edge weight e is always greater than 0.0522. In other 
words, we observe that a missing edge in a maximum 
weighted quasi-biclique has a higher expected weight 
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than the weight of a randomly selected missing edge. 
This happens when the a and /3 values are chosen to 
derive a,/3-WQB's which are more dense in terms of 
weight. 

We further compare e from our approach to e from the 
(5-quasi-bicliques (c)-QB) described by Liu et al. [3]. All 
quasi-bicliques (including exact <5-QB using our IP formu- 
lation) used to induce average missing edge weight e are: 

(1) D05/M1: (5-QB with S = 0.5 and minimum node 
size is 1, i.e., m = 1. 

(2) D05/M2: <S-QB with S = 0.5; m = 2. 

(3) AB/M2: a,P-WQB P using the minimum average 
edge weights found from D05/M2 as a and [3; m = 2. 

(4) AX/M2: a,/?-WQB P where X = a = ft e {0.05, 0.1, 
0.2, 0.3, 0.4, 0.5}; m = 2. 

(5) CX/M2: a,/3-WQB c where X = a = /3 e {1, 2, 3, 4, 
5}; w = 2. 

Comparing the averages of e from A005/M2 to A05/ 
M2 (please see Table 2), we see a steady increase. Since 
a and j3 can be seen as expected edge weights of the 
resulting QB, the changing in e shows that QB's identify 
sub-graphs of expected edge weights. However, we do 
not see a similar pattern in the constant versions. Over- 
all, in this particular experiment data set, the removed 
edge weights are at most 0.16, hence e can never 
approach closely to the parameter a no matter how 
lenient the parameters are. 

Method 

Time complexity 

Here we prove the NP-hardness of the a,/J-WQB P(C ) 
problem by a reduction from the maximum edge bicli- 
que problem. Note that the query P ^ problem is a gener- 
alization of a.jS-WQB^c) problem and hence, is also 
NP-hard. 

Lemma 1. The a,f3-WQBp {C ) problem is NP-hard. 

Proof. Given a bipartite graph G := (U + V, E) and an 
integer k, the maximum edge biclique problem asks if G 
contains a biclique with atleast k edges. The maximum 
edge biclique problem is NP-complete [10]. Let G' := (U 
+ V, E', at ) be a weighted bipartite graph where m(u, v) 
is set to 1 if (w, v) e £ or is set to 0 otherwise. Note 
that, there is a biclique with k edges in G if and only if 
the maximum weighted afi-WQB P in G' has a weight of 
atleast k when a and /3 are set to 1. Similarly, there is a 
biclique with k edges in G if and only if the maximum 



weighted a,/3-WQB c in G' has a weight of atleast k 
when a and /3 are set to 0. Therefore, the a,/J-WQBp (c) 
problem is NP-hard. 

We now prove that checking for the existence of a 
percentage a,/3-WQB in a bipartite graph is NP-com- 
plete. Note that, checking the existence of a constant 
version a,/J-WQB in a bipartite graph can be done in 
polynomial time. For rest of the section we only refer to 
percentage version a,j5- WQB's. 

Problem 3 (Existence). 

Instance: A weighted bipartite graph G := {U + V, E, 
»), values a, /3 e [0, 1]. 

Find: If there exists a afi-WQBp (W, V) in G. 

To prove the hardness of existence problem we need 
some auxiliary definitions. A modified weighted bipartite 
graph, denoted by {U + V, E, D.), is a complete bipartite 
graph (U + V, E) with a weight function Cl: E — > [0, 1] 
where, for any two edges e and e', |0(e) - n(e1| < 1. 

Definition 4 (Modified a,/3-WQB (MO-WQB)). 

Let G := {U + V, E, D.) be a modified weighted bipar- 
tite graph. A non-empty pair (W, V) included in (U, V) 
is a MO-WQB of G, if it satisfies the three properties: (1) 
(W, V) includes (0, V), (2) Vm e W : E V1S y, w(u, v) > 0, 
and (3) Vve V : Z He w, w(u, v) > 0. 

Problem 4 (One sided existence). 

Instance: A weighted bipartite graph G := {U + V, E, 
ft)), values a, ji e [0, 1]. 

Find: If there exists a a,f5-WQB P (W, V) in G which 
includes the pair (0, V). 

Problem 5 (Modified existence). 

Instance: A modified weighted bipartite graph G := {U 

+ v, e, a). 

Find: If there exists a MO-WQB in G. 

The series of reductions to prove the hardness of the 
existence problem are as follows. We first reduce the 
partition problem, which is NP-complete [18], to the 
modified existence problem. The modified existence pro- 
blem is then reduced to the one sided existence pro- 
blem. The one sided existence problem reduces to the 
existence problem. 

Lemma 2. The modified existence problem is NP- 
complete. 

Proof. The proof of MO-WQB e NP can be briefly 
described in the following. 

Given a weighted bipartite graph G{U + V, E, Cl) and 
a pair (W, V) included in {U, V), it can be verified in 



Table 2 Missing edge recovery in a genetic interaction network 



WQBp 


d05/m1 


d05/m2 


ab/m2 


a005/m2 


a01/m2 


a02/m2 


a03/m2 


a04/m2 


a05/m2 


avg(e) 


0.0855 


0.0844 


0.0850 


0.0806 


0.0830 


0.0867 


0.0905 


0.0934 


0.1169 


WQB C 










C1/M2 


C2/M2 


C3/M2 


C4/M2 


C5/M2 


avg(e) 










0.1008 


0.0805 


0.0809 


0.0823 


0.0825 



A comparison of e under various QB parameters showing improvements of recovered edge weight expectation in a,/S-WQB's. 
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polynomial time if the pair (W, V) satisfies the all the 
MO-WQB constraints for G. So, the modified existence 
problem belongs to class NP. The reduction from parti- 
tion problem is as follows. 

We are left to show that partition < p MO-WQB. Given 
a finite set A, and a size s{a) e Z + associated with every 
element a of A, the partition problem asks if A can be 
partitioned into two sets (A lt A 2 ) such that 

a. Construction: Let SUM be the sum of sizes of all 
elements in A. Build a modified weighted bipartite 
graph G := (U + V, E, O) as follows. For every ele- 
ment a in A there is a corresponding vertex u a in U. 
The set V contains two vertices v + and v.. For every 
vertex u a e U, Q.(u a , v+) = s(a)/(2 x SUM) and 
n(« a> v.) = -s(«)/(2 x SUM). Add an additional ver- 
tex u sum to set U. Set Cl(u sum , v + ) to -1/4 and 
0(i<s„ m , v.) to 1/4. Note that, the weights assigned to 
edges of G satisfy the constraint on O for a modified 
weighted bipartite graph. 

b. =>: Let (A v A 2 ) be a partition of A such that the 
sum of the sizes of elements in A 1 is equal to the 
sum of the sizes of elements in A 2 . Let U\ = {u a : a 
e A-l}. The sum of weights of all edges from v + to 
the vertices in U 1 is equal to 1/4. Let W = U l U 
u sum . The sum of weights of all edges from v + to 
vertices of W is 0. Similarly, the sum of weights of 
all edges from v. to vertices of W is 0. Thus, (W, V) 
is a MO-WQB of G. 

<=: Let (P, V) be a of G. The edge 

from v_ to Usum is the only positive weighted 
edge from vertex v.. So, P will contain vertex Wj;. 
Since 0(v + , w SMm ) is negative, set P will also con- 
tain vertices from U - u sum . The sum of the 
weights of edges from v. to vertices in P - u sum 
cannot be smaller than -1/4. Similarly, the sum 
of the weights of edges from v + to vertices in P - 
u sum cannot be smaller than 1/4. So, the sum of 
all elements in A corresponding to the vertices 
in P - u sum should be equal to SUM/2. This 
proves that if G contains a MO-WQB, set A can 
be partitioned. 

Hence, the modified existence problem is NP- 
complete. 

Lemma 3. The one sided existence problem is NP- 
complete. 

Proof. The proof of one sided existence e NP is 
omitted for brevity. Next we show MO-WQB < p one 
sided existence. We prove this problem to be NP-com- 
plete by a reduction from the modified existence pro- 
blem. The reduction is as follows. 



a. Construction: Let G := {U + V, E, Ci) be the modi- 
fied weighted bipartite graph in an instance of the 
modified existence problem. We build a graph G' := 
(U + V, E, a>) for an instance of one sided existence 
problem from G. Notice that the partition and vertices 
remain the same. If the weight of every edge in the G 
is non negative, set a = ft = 0 and m(u, v) = Cl(u, v) for 
every edge (u, v) e E. Otherwise, set a and fi to \x\ 
and m(u, v) = Cl(u, v) - x for every edge (w, v) e E, 
where x is the minimum edge weight in G. 

b. => and ^: Let (W, V) be a MO-WQB of graph G. If 
weights of all edges in G are non negative, the con- 
straints for both the problems are the same. If G has 
negative weighted edges, the constraints of both the 
problems will be the same when a,/3 and w for the one 
sided existence problem instance are set as mentioned 
in the construction. It can be seen that there is a MO- 
WQB in G if and only if there is a a,/J-WQB P in the 
graph G' which includes the pair (0, B). 

This proves that the one sided existence problem is 
NP-complete. 

Lemma 4. Existence problem is NP-complete. 

Proof. Given a set of vertices {W, V), a weighted 
bipartite graph G = (U + V, E, m) and values a, ji e [0, 
1], it can be verified in polynomial time if (W, V) is a a, 
/3-WQEp in G. Thus, the existence problem belongs to 
NP. We now show that One sided existence < p existence. 

a. Construction: Let G' = (U + V, E', m), a', ji' e [0, 
1] be the parameters of the one sided existence pro- 
blem. We build the weighted bipartite graph G = 
(U p + V, E, m) for the instance of existence problem 
as follows. First, set G = G'. For every vertex u e U, 
let S u denote the sum of the weights of all edges 
incident on u. Delete every vertex u e U whose S u is 
less than a\V\. Let (U p + V) denote the remaining 
vertices, and E' represent the remaining edges in G. 
For the instance of the existence problem, set a = 0 
and B = B'. 

b. => and <=: Any a,f}-WQB P in G' which includes (0, 
V), is also afi-WQBp in G. Consider a a,R-WQB P 
(W, V) in G. If V = V, then (W, V) is a a,B-WQB P 
in G' which includes the pair (0, V). If V is not the 
same set as V, the pair (W, V) is still a afi-WQBp in 
G' and it includes the pair (0, V). 

IP formulations for the a,/J-WQB problem 

Although greedy approaches are often used in problems 
of a similar structure, e.g., multi-dimensional knapsack 
[19], (5-QB [3], in our experiments, both greedy and ran- 
domized approach did not identify solutions close 
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enough to the exact solutions. In our experiments, sim- 
ple greedy and randomized solutions yielded accuracies 
ranging from 60% to 95% depending on various para- 
meters without performance guarantee. Hence we con- 
sider that it is rather important here to find exact 
solutions in order to demonstrate the usefulness of a,fi- 
WQB's. Here we present integer programming (IP) for- 
mulations solving the a,/f-WQB problem in exact 
solutions. 

Due to the similarity in formulating constraints 
between a,/J-WQB c and afi-WQBp, we start by formu- 
lating a solution to a,j3-WQB P . Our initial IP requires 
quadratic constraints, which are then replaced by linear 
constraints such that it can be solved by various optimi- 
zation software packages. Our final formulation is further 
improved by adopting the implication rule to simplify 
variables involved. This improved formulation requires 
variables and constraints linear to the number of input 
edges, and thus, suits better for sparse graphs. Through- 
out the section, unless stated otherwise, G := (U + V, E, 
co) represents a weighted bipartite graph, and G' = (W, 
V) represents the maximum weighted a,/J-WQB of G 
and E' represents the edges induced by G' in G. 
Quadratic programming 

For each u e U (v e V), a binary variable x u (x v ) is 
introduced. The variable x u (x v ) is 1 if and only if vertex 
u (v) is in W (V). The integer program to find the solu- 
tion G' can be formulated as follows. 

Binary variables: x u , s.t. x u = 1 iff u e W for each u e U (1) 

x v , s.t. x v = 1 iff v € V for each v e V (2) 

Subject to: £ reV ii)(u, for all u e U (3) 

J2ueU v ) ■ x u x v > P T,ueU W» for all m e V (4) 

Maximize: E( u ,,)eijxvW ' v ) (5) 

The quadratic terms in the constraints are necessary 
because, a and /3 thresholds apply only to vertices in W 
and V. This formulation uses variables and constraints 
linear to the size of input vertices, i.e., 0(|L/| + |^|). 
Since solving a quadratic program usually requires a 
proprietary solver, we reformulate the program so that 
all expressions are linear. 
Converted linear programming 

A standard approach to convert a quadratic program to a 
linear one is introducing auxiliary variables to replace the 
quadratic terms. Here we introduce a binary variable y uv 
for every edge (u, v) in G, such that, y uv = 1 if and only if 
x u = x v = 1, i.e., the edge (w, v) is in G'. The linear program 
to find the solution G' is formulated as follows. 



Binary variables: Same as in (1) and (2) 

y uv , s.t., y uv = 1 iff x u = x v = 1 for all {u, v) e E (6) 

Subject to: y uv < (x u + x v ) jl for all {u, v) e E (7) 

Yuv > x u + x v — 1 for all («, v) e E (8) 

E re v M i u ' v ) -yuv>a Y,vev Yuv for all u e U (9) 

T,ueU HU, V) -Yuv> P T,ueU Yuv for all V € V (10) 

Maximize: £ (u ,„ )e(Jx v a>(u, v) ■ Yuv (11) 

Expressions (7) and (8) state the condition that y uv = 1 
if and only of x u = x v = 1. Expression (8) ensures that, 
for any edge whose end points (u, v) are chosen to be in 
G', y uv is set to 1. Due to the use of y uv variables, this 
formulation requires 0(|£/|| V|) variables and constraints. 
Improved linear programming 

Observe that constraint (7) becomes trivial if y uv = 0. In 
other words, this constraint formulates implications, e.g., 
for binary variables p and q, the expression p < q is 
equivalent to p — > q. Expanding on this idea, we elimi- 
nate the requirement of variables y uv in constraints (9) 
and (10) in the next formulation while sharing the rest 
of the aforementioned linear program. 

Subject to: J2vev ( w ( u < v ) ~ a ) x « ^ l v K*« ~ l ) for all u e U (12) 

£ uelJ Ku,v) -fi)xu > \U\{xy- 1) for alive V (13) 

There is a variable x v for every vertex v in G. There is 
a variable y uv for every edge (w, v) in G whose weight is 
not 0. The variable y uv is set to 1 if and only if both x u 
and x v are set to 1. For any vertex we l/(ve V), the 
variable x u (x v ) is set to 1 if and only if vertex u (v) is in 
G'. Constraint (12) can also be explained as follows. If 
x u = 1, the constraint transforms to the second con- 
straint in the a,/3-WQB Definition. If x u = 0, constraint 
(12) becomes trivial. Constraint (13) can be explained in 
a similar manner. 

Generalized formulation for afi-WQBp and a,p-WQB c 

Recall that the difference between the two problems a, 
fi-WQBp and a,/3-WQB c is in the edge weight summa- 
tion which we can combine as the following properties: 
(1) Vme W : Z ve v' &>(u, v) > a P \ V\ - a c , and (2) W e 
V : Zu^w co{u, v) > ftp \ U'\ - fic, where (a P , /3 P ) and 
(a c , Pc ) are the parameters given in afi-WQBp and a, 
/3-WQB c respectively. Following the same reasoning in 
the previous paragraphs, linear constraints (12) and (13) 
are now updated as the following. 
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Subject to: J2veV ( £0 ("' v ) ~ a p) x v + <"c> \V\[x u - 1) for all u e U (14) 

£„ £lJ (®(u, i/) - £p)x„ + j8 c > - 1) for all v e V (15) 

As a results, the problem instance is a a,/2-WQB c pro- 
blem if (a P , f} P ) = (1, 1), and it is a afi-WQBp problem 
if (a Ct cc c ) = (0, 0). Note that the formulation does not 
require either condition to present; it essentially defines 
a generalization of a,/i-WQB problems when all 4 para- 
meters are valid and non-zero. 

If there are n vertices in U and m vertices in V, there 
will be a total of m + n + Ik constraints and m + n + k 
variables where k is the number of edges whose weight 
is not equal to 0. The above formulations can be 
extended to solve the query problem by adding an addi- 
tional constraint x v = 1 to the formulation, for every 
vertex v e P U Q. Similar constraints also help us 
explore sub optimal solutions, e.g., excluding known 
vertices in subsequent solutions, or provide a lower- 
bound of required query items in the optimal solution. 

Conclusions 

We address noise and incompleteness in biological net- 
works by introducing graph-theoretical optimization 
problems that identify variations of novel weighted 
quasi-bicliques. These quasi-biclique problems incorpo- 
rate biological interaction levels in different analytical 
settings and exhibit improvements over un-weighted 
quasi-bicliques. To meet demands of biologists we also 
provide a query version of (weighted) quasi-biclique 
problems. We prove that our problems are NP-hard, 
and describe IP formulations that can tackle moderate 
sized problem instances. Simulations and empirical data 
solved by our IP formulation suggest that our weighted 
quasi-biclique problems are applicable to various other 
biological networks. 

Future work will concentrate on the design of algo- 
rithms for solving large-scale instances of weighted 
quasi-biclique problems within guaranteed bounds. 
Greedy approaches may result in effective heuristics that 
can analyze ever-growing biological networks. A practi- 
cal extension to the query problem is the development 
of an efficient enumeration of all maximal a,/J-WQB's. 
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